Spectrum Full Fine Tuning

NLP
PEFT
Fine Tuning
SNR
Full Fine Tuning using Spectrum (SNR)
Author

Uday

Published

October 30, 2024

Spectrum is a new method for full fine-tuning pre-trained language models. It is an approach for efficient LLM fine-tuning that selectively tunes only a small subset of layers based on their SNR (Signal to Noise Ratio).

It Uses Random Matrix Theory (RMT) especially Marchenko-Pastur Law to predict the SNR of each layer. This law charecterizes eigan value distributions in large random matrices.

To understand this better, let’s first understand the basics of Linear Algebra.

Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices. It is represented as:

\[\begin{align} W = U \Sigma V^T \end{align}\]

When applied to the weight matrix of neural networks,

Eigan Value Decomposition (EVD)

Eigan Value Decomposition (EVD) is a matrix factorization technique that decomposes a matrix into two matrices. It is similar to SVD but for square matrices. It is represented as:

\[\begin{align} W = Q \Lambda Q^T \end{align}\]

Relation between SVD and EVD

SVD of a matrix \(W\) of shape \((m \times n)\) is given as \(W = U \Sigma V^T\)

lets look at \(W.W^T\)

\[\begin{align} W.W^T = (U \Sigma V^T)(U \Sigma V^T)^T = U \Sigma^2 U^T \end{align}\]

\(W.W^T\) is a square matrix and can be decomposed using EVD as \(W.W^T = Q \Lambda Q^T\)

so \(\Lambda_{W.W^T} = \Sigma^2_{W}\)

Marchenko-Pastur Law

In random matrix theory, the Marchenko-Pastur Law describes the asymptotic behavior of the eigenvalues of large random matrices. It tells that, For a matrix \(W\) of shape \((m \times n)\) with \(m \geq n\), the eigenvalues of \(\frac{1}{n}W.W^T\) converge to a distribution bounded by:

\[\begin{align} \lambda \in \left[\sigma^2 \left(1 - \sqrt{\frac{m}{n}}\right)^2, \sigma^2 \left(1 + \sqrt{\frac{m}{n}}\right)^2\right] \end{align}\]

the bound for singular values of \(W\) is given by:

\[\begin{align} \sigma \in \left[ \frac{1}{\sqrt{n}} \sigma \left(1 - \sqrt{\frac{m}{n}}\right) , \frac{1}{\sqrt{n}} \sigma \left(1 + \sqrt{\frac{m}{n}}\right)\right] \end{align}\]

Signal to Noise Ratio (SNR)

from above, maximum threshold for the singular value random matrix is \(\sigma_{\text{thresh}} = \sigma \left(1 - \sqrt{\frac{m}{n}}\right)\)

lets take \(S\) contains the singular values of \(W\), then SNR is given by:

\[\begin{align} SNR = \frac{\text{count of singular values in S} \gt \sigma_{thresh}}{\text{count of singular values in S} \leq \sigma_{thresh}} \end{align}\]

Layer Selection

Layer Selection

Reference: 1. https://arxiv.org/abs/2406.06623 2. https://huggingface.co/blog/anakin87/spectrum